A project from Medical domain. The dataset created by Max Little of the University of Oxford in collaboration with the National Centre for Voice and Speech, Denver, Colorado, is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease.
Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased. Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.
The dataset is extracted from the paper: 'Exploiting NonLinear Recurrence and Fractual Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering Online 2007, 6:23 (23 June, 2007)
This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.
The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.
The columns are as follows:
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA - Several measures of variation in amplitude
NHR, HNR - Two measures of ratio of noise to tonal components in the voice
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE, D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1, spread2, PPE - Three nonlinear measures of fundamental frequency variation.
The goal is to classify the patients into the respective labels using the attributes from their voice recordings.
import numpy as np
import pandas as pd
import seaborn as sns
from os import system
import itertools
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from IPython.display import Image
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from mlxtend.classifier import StackingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing
from sklearn import tree
import warnings
warnings.filterwarnings('ignore')
parkinson_data = pd.read_csv('Data - Parkinsons')
One of the biggest challenge in the dataset according to me is understanding the each and every attributes clearly i.e., what the attributes mean. The attributes are heavily doused with medical terms, which makes it quite difficult to understand what each attributes mean without having sufficient domain knowledge.
parkinson_data.shape
(195, 24)
The two-dimensional dataframe i.e., parkinson_data consists of 195 rows and 24 columns.
parkinson_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 195 non-null object 1 MDVP:Fo(Hz) 195 non-null float64 2 MDVP:Fhi(Hz) 195 non-null float64 3 MDVP:Flo(Hz) 195 non-null float64 4 MDVP:Jitter(%) 195 non-null float64 5 MDVP:Jitter(Abs) 195 non-null float64 6 MDVP:RAP 195 non-null float64 7 MDVP:PPQ 195 non-null float64 8 Jitter:DDP 195 non-null float64 9 MDVP:Shimmer 195 non-null float64 10 MDVP:Shimmer(dB) 195 non-null float64 11 Shimmer:APQ3 195 non-null float64 12 Shimmer:APQ5 195 non-null float64 13 MDVP:APQ 195 non-null float64 14 Shimmer:DDA 195 non-null float64 15 NHR 195 non-null float64 16 HNR 195 non-null float64 17 status 195 non-null int64 18 RPDE 195 non-null float64 19 DFA 195 non-null float64 20 spread1 195 non-null float64 21 spread2 195 non-null float64 22 D2 195 non-null float64 23 PPE 195 non-null float64 dtypes: float64(22), int64(1), object(1) memory usage: 36.7+ KB
All the attributes apart from name contains numerical values.
parkinson_data.isnull().sum()
name 0 MDVP:Fo(Hz) 0 MDVP:Fhi(Hz) 0 MDVP:Flo(Hz) 0 MDVP:Jitter(%) 0 MDVP:Jitter(Abs) 0 MDVP:RAP 0 MDVP:PPQ 0 Jitter:DDP 0 MDVP:Shimmer 0 MDVP:Shimmer(dB) 0 Shimmer:APQ3 0 Shimmer:APQ5 0 MDVP:APQ 0 Shimmer:DDA 0 NHR 0 HNR 0 status 0 RPDE 0 DFA 0 spread1 0 spread2 0 D2 0 PPE 0 dtype: int64
None of the columns have null values.
parkinson_data.apply(lambda x: len(x.unique()))
name 195 MDVP:Fo(Hz) 195 MDVP:Fhi(Hz) 195 MDVP:Flo(Hz) 195 MDVP:Jitter(%) 173 MDVP:Jitter(Abs) 19 MDVP:RAP 155 MDVP:PPQ 165 Jitter:DDP 180 MDVP:Shimmer 188 MDVP:Shimmer(dB) 149 Shimmer:APQ3 184 Shimmer:APQ5 189 MDVP:APQ 189 Shimmer:DDA 189 NHR 185 HNR 195 status 2 RPDE 195 DFA 195 spread1 195 spread2 194 D2 195 PPE 195 dtype: int64
As informed in the Data Description section mentioned above, we can see that apart from the attribute status, every other attribute has continuous values.
As attribute name is not useful for this analysis we can make it as index.
parkinson_data = parkinson_data.set_index('name')
parkinson_data.head().T
| name | phon_R01_S01_1 | phon_R01_S01_2 | phon_R01_S01_3 | phon_R01_S01_4 | phon_R01_S01_5 |
|---|---|---|---|---|---|
| MDVP:Fo(Hz) | 119.992000 | 122.400000 | 116.682000 | 116.676000 | 116.014000 |
| MDVP:Fhi(Hz) | 157.302000 | 148.650000 | 131.111000 | 137.871000 | 141.781000 |
| MDVP:Flo(Hz) | 74.997000 | 113.819000 | 111.555000 | 111.366000 | 110.655000 |
| MDVP:Jitter(%) | 0.007840 | 0.009680 | 0.010500 | 0.009970 | 0.012840 |
| MDVP:Jitter(Abs) | 0.000070 | 0.000080 | 0.000090 | 0.000090 | 0.000110 |
| MDVP:RAP | 0.003700 | 0.004650 | 0.005440 | 0.005020 | 0.006550 |
| MDVP:PPQ | 0.005540 | 0.006960 | 0.007810 | 0.006980 | 0.009080 |
| Jitter:DDP | 0.011090 | 0.013940 | 0.016330 | 0.015050 | 0.019660 |
| MDVP:Shimmer | 0.043740 | 0.061340 | 0.052330 | 0.054920 | 0.064250 |
| MDVP:Shimmer(dB) | 0.426000 | 0.626000 | 0.482000 | 0.517000 | 0.584000 |
| Shimmer:APQ3 | 0.021820 | 0.031340 | 0.027570 | 0.029240 | 0.034900 |
| Shimmer:APQ5 | 0.031300 | 0.045180 | 0.038580 | 0.040050 | 0.048250 |
| MDVP:APQ | 0.029710 | 0.043680 | 0.035900 | 0.037720 | 0.044650 |
| Shimmer:DDA | 0.065450 | 0.094030 | 0.082700 | 0.087710 | 0.104700 |
| NHR | 0.022110 | 0.019290 | 0.013090 | 0.013530 | 0.017670 |
| HNR | 21.033000 | 19.085000 | 20.651000 | 20.644000 | 19.649000 |
| status | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| RPDE | 0.414783 | 0.458359 | 0.429895 | 0.434969 | 0.417356 |
| DFA | 0.815285 | 0.819521 | 0.825288 | 0.819235 | 0.823484 |
| spread1 | -4.813031 | -4.075192 | -4.443179 | -4.117501 | -3.747787 |
| spread2 | 0.266482 | 0.335590 | 0.311173 | 0.334147 | 0.234513 |
| D2 | 2.301442 | 2.486855 | 2.342259 | 2.405554 | 2.332180 |
| PPE | 0.284654 | 0.368674 | 0.332634 | 0.368975 | 0.410335 |
parkinson_data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| MDVP:Fo(Hz) | 195.0 | 154.228641 | 41.390065 | 88.333000 | 117.572000 | 148.790000 | 182.769000 | 260.105000 |
| MDVP:Fhi(Hz) | 195.0 | 197.104918 | 91.491548 | 102.145000 | 134.862500 | 175.829000 | 224.205500 | 592.030000 |
| MDVP:Flo(Hz) | 195.0 | 116.324631 | 43.521413 | 65.476000 | 84.291000 | 104.315000 | 140.018500 | 239.170000 |
| MDVP:Jitter(%) | 195.0 | 0.006220 | 0.004848 | 0.001680 | 0.003460 | 0.004940 | 0.007365 | 0.033160 |
| MDVP:Jitter(Abs) | 195.0 | 0.000044 | 0.000035 | 0.000007 | 0.000020 | 0.000030 | 0.000060 | 0.000260 |
| MDVP:RAP | 195.0 | 0.003306 | 0.002968 | 0.000680 | 0.001660 | 0.002500 | 0.003835 | 0.021440 |
| MDVP:PPQ | 195.0 | 0.003446 | 0.002759 | 0.000920 | 0.001860 | 0.002690 | 0.003955 | 0.019580 |
| Jitter:DDP | 195.0 | 0.009920 | 0.008903 | 0.002040 | 0.004985 | 0.007490 | 0.011505 | 0.064330 |
| MDVP:Shimmer | 195.0 | 0.029709 | 0.018857 | 0.009540 | 0.016505 | 0.022970 | 0.037885 | 0.119080 |
| MDVP:Shimmer(dB) | 195.0 | 0.282251 | 0.194877 | 0.085000 | 0.148500 | 0.221000 | 0.350000 | 1.302000 |
| Shimmer:APQ3 | 195.0 | 0.015664 | 0.010153 | 0.004550 | 0.008245 | 0.012790 | 0.020265 | 0.056470 |
| Shimmer:APQ5 | 195.0 | 0.017878 | 0.012024 | 0.005700 | 0.009580 | 0.013470 | 0.022380 | 0.079400 |
| MDVP:APQ | 195.0 | 0.024081 | 0.016947 | 0.007190 | 0.013080 | 0.018260 | 0.029400 | 0.137780 |
| Shimmer:DDA | 195.0 | 0.046993 | 0.030459 | 0.013640 | 0.024735 | 0.038360 | 0.060795 | 0.169420 |
| NHR | 195.0 | 0.024847 | 0.040418 | 0.000650 | 0.005925 | 0.011660 | 0.025640 | 0.314820 |
| HNR | 195.0 | 21.885974 | 4.425764 | 8.441000 | 19.198000 | 22.085000 | 25.075500 | 33.047000 |
| status | 195.0 | 0.753846 | 0.431878 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| RPDE | 195.0 | 0.498536 | 0.103942 | 0.256570 | 0.421306 | 0.495954 | 0.587562 | 0.685151 |
| DFA | 195.0 | 0.718099 | 0.055336 | 0.574282 | 0.674758 | 0.722254 | 0.761881 | 0.825288 |
| spread1 | 195.0 | -5.684397 | 1.090208 | -7.964984 | -6.450096 | -5.720868 | -5.046192 | -2.434031 |
| spread2 | 195.0 | 0.226510 | 0.083406 | 0.006274 | 0.174351 | 0.218885 | 0.279234 | 0.450493 |
| D2 | 195.0 | 2.381826 | 0.382799 | 1.423287 | 2.099125 | 2.361532 | 2.636456 | 3.671155 |
| PPE | 195.0 | 0.206552 | 0.090119 | 0.044539 | 0.137451 | 0.194052 | 0.252980 | 0.527367 |
The numerical attributes are summarised in the following manner:
i. MDVP:Fo(Hz): There are 195 records with a mean value of 154.23 Hz. The minimum and maximum frequency recorded by the individuals are 88.33 Hz and 260.11 Hz respectively. 25% of people have an average vocal fundamental frequency under 117.57 Hz, 50% of people have an average vocal fundamental frequency under 148.79 Hz whereas 75% of people have an average vocal fundamental frequency under 182.77 Hz. Also, the observations differ from the mean value by 41.39 Hz
ii. MDVP:Fhi(Hz): There are 195 records with a mean value of 197.10 Hz. The greatest minimum and maximum frequency recorded over period-to-period by the individuals are 102.15 Hz and 592.03 Hz respectively. For 25% of the observed people the value is under 134.86 Hz, for 50% of people it is under 175.83 Hz whereas for 75% of people it is under 224.21 Hz. Also, the observations differ from the mean value by 91.49 Hz
iii. MDVP:Flo(Hz): There are 195 records with a mean value of 116.32 Hz. The lowest minimum and maximum frequency recorded over period-to-period by the individuals are 65.48 Hz and 239.17 Hz respectively. For 25% of the observed people the value is under 84.29 Hz, for 50% of people it is under 104.32 Hz whereas for 75% of people it is under 140.02 Hz. Also, the observations differ from the mean value by 43.52 Hz
iv. MDVP:Jitter(%): There are 195 records with a mean value of 0.0062%. The minimum and maximum value recorded for the observed individuals are 0.00168% and 0.03316% respectively. For 25% of the observed people the value is under 0.00346%, for 50% of people it is under 0.00494% whereas for 75% of people it is under 0.007365%. Also, the observations differ from the mean value by 0.0049%
v. MDVP:Jitter(Abs): There are 195 records with a mean value of 0.000044. The minimum and maximum variablitity of the pitch within the analyzed voice sample for the observed individuals are 0.000007 and 0.00026 respectively. For 25% of the observed people the value is under 0.00002, for 50% of people it is under 0.00003 whereas for 75% of people it is under 0.00006. Also, the observations differ from the mean value by 0.000035
vi. MDVP:RAP: There are 195 records with a mean value of 0.003306. The minimum and maximum variablitity of the pitch within the analyzed voice sample with a smoothing factor (of 3 periods) for the observed individuals are 0.00068 and 0.021440 respectively. For 25% of the observed people the value is under 0.00166, for 50% of people it is under 0.0025 whereas for 75% of people it is under 0.003835. Also, the observations differ from the mean value by 0.002968
vii. MDVP:PPQ: There are 195 records with a mean value of 0.003446. The minimum and maximum variablitity of the pitch within the analyzed voice sample with a smoothing factor (of 5 periods) for the observed individuals are 0.00092 and 0.01958 respectively. For 25% of the observed people the value is under 0.00186, for 50% of people it is under 0.00269 whereas for 75% of people it is under 0.003955. Also, the observations differ from the mean value by 0.002759
viii. Jitter:DDP: There are 195 records with a mean value of 0.00992. For the observed persons it ranges from 0.00204 to 0.06433. For 25% of the observed people the value is under 0.004985, for 50% of people it is under 0.00749 whereas for 75% of people it is under 0.011505. Also, the observations differ from the mean value by 0.008903
ix. MDVP:Shimmer: There are 195 records with a mean value of 0.0297. For the observed persons it ranges from 0.00954 to 0.11908. For 25% of the observed people the value is under 0.016505, for 50% of people it is under 0.02297 whereas for 75% of people it is under 0.037885. Also, the observations differ from the mean value by 0.018857
x. MDVP:Shimmer(dB): There are 195 records with a mean value of 0.28 dB. For the observed persons it ranges from 0.085 dB to 1.302 dB. For 25% of the observed people the value is under 0.149 dB, for 50% of people it is under 0.221 dB whereas for 75% of people it is under 0.35 dB. Also, the observations differ from the mean value by 0.195 dB
xi. Shimmer:APQ3: There are 195 records with a mean value of 0.015664. For the observed persons it ranges from 0.00455 to 0.05647. For 25% of the observed people the value is under 0.008245, for 50% of people it is under 0.01279 whereas for 75% of people it is under 0.020265. Also, the observations differ from the mean value by 0.010153
xii. Shimmer:APQ5: There are 195 records with a mean value of 0.017878. For the observed persons it ranges from 0.0057 to 0.0794. For 25% of the observed people the value is under 0.00958, for 50% of people it is under 0.01347 whereas for 75% of people it is under 0.02238. Also, the observations differ from the mean value by 0.012024
xiii. MDVP:APQ: There are 195 records with a mean value of 0.024081. For the observed persons it ranges from 0.00719 to 0.13778. For 25% of the observed people the value is under 0.01308, for 50% of people it is under 0.01826 whereas for 75% of people it is under 0.0294. Also, the observations differ from the mean value by 0.016947
xiv. Shimmer:DDA: There are 195 records with a mean value of 0.046993. For the observed persons it ranges from 0.01364 to 0.16942. For 25% of the observed people the value is under 0.024735, for 50% of people it is under 0.03836 whereas for 75% of people it is under 0.060795. Also, the observations differ from the mean value by 0.030459
xv. NHR: There are 195 records with a mean value of 0.024847. For the observed persons it ranges from 0.00065 to 0.31482. For 25% of the observed people the value is under 0.00065, for 50% of people it is under 0.005925 whereas for 75% of people it is under 0.01166. Also, the observations differ from the mean value by 0.040418
xvi. HNR: There are 195 records with a mean value of 21.885974. For the observed persons it ranges from 8.441 to 33.047. For 25% of the observed people the value is under 19.198, for 50% of people it is under 22.085 whereas for 75% of people it is under 25.0755. Also, the observations differ from the mean value by 4.425764
xvii. status: It is clear that the maximum number of observed individuals does have Parkinson Disease.
xviii. RPDE: There are 195 records with a mean value of 0.498536. For the observed persons it ranges from 0.25657 to 0.685151. For 25% of the observed people the value is under 0.421306, for 50% of people it is under 0.495954 whereas for 75% of people it is under 0.587562. Also, the observations differ from the mean value by 0.103942
xix. DFA: There are 195 records with a mean value of 0.718099. For the observed persons it ranges from 0.574282 to 0.825288. For 25% of the observed people the value is under 0.674758, for 50% of people it is under 0.722254 whereas for 75% of people it is under 0.761881. Also, the observations differ from the mean value by 0.055336
xx. spread1: There are 195 records with a mean value of -5.684397. For the observed persons it ranges from -7.964984 to -2.434031. For 25% of the observed people the value is under -6.450096, for 50% of people it is under -5.720868 whereas for 75% of people it is under -5.046192. Also, the observations differ from the mean value by 0.008903
xxi. spread2: There are 195 records with a mean value of 0.22651 For the observed persons it ranges from 0.006274 to 0.450493. For 25% of the observed people the value is under 0.174351, for 50% of people it is under 0.218885 whereas for 75% of people it is under 0.279234. Also, the observations differ from the mean value by 0.083406
xxii. D2: There are 195 records with a mean value of 2.381826. For the observed persons it ranges from 1.423287 to 3.671155. For 25% of the observed people the value is under 2.099125, for 50% of people it is under 2.361532 whereas for 75% of people it is under 2.636456. Also, the observations differ from the mean value by 0.382799
xxiii. PPE: There are 195 records with a mean value of 0.206552. For the observed persons it ranges from 0.044539 to 0.527367. For 25% of the observed people the value is under 0.137451, for 50% of people it is under 0.194052 whereas for 75% of people it is under 0.25298. Also, the observations differ from the mean value by 0.090119.
# plotting of 'MDVP:Fo(Hz)':
sns.distplot(parkinson_data['MDVP:Fo(Hz)'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc62c54708>
From the above plot it seems that the curve is slightly positively skewed.
# measure of skewness of 'MDVP:Fo(Hz)':
parkinson_data['MDVP:Fo(Hz)'].skew()
0.5917374636540784
The curve being slightly positive skewed is being ascertained.
# presence of outliers in 'MDVP:Fo(Hz)':
sns.boxplot(parkinson_data['MDVP:Fo(Hz)'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc62fa7f88>
From the above plot it is clear that the attribute 'MDVP:Fo(Hz)' doesn't have any outliers in them.
# plotting of 'MDVP:Fhi(Hz)':
sns.distplot(parkinson_data['MDVP:Fhi(Hz)'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc6301b648>
From the plot it is clear that the there are individuals having extreme values of maximum vocal fundamental frequency.
# measure of skewness of 'Experience':
parkinson_data['MDVP:Fhi(Hz)'].skew()
2.542145997588398
The curve is highly positively skewed.
# presence of outliers in 'MDVP:Fhi(Hz)':
sns.boxplot(parkinson_data['MDVP:Fhi(Hz)'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc630ef148>
From the above plot it is clear that 'MDVP:Fhi(Hz)' does have outliers with them. The number of outliers can be calculated as follows:
fhi_25 = np.percentile(parkinson_data['MDVP:Fhi(Hz)'], 25)
fhi_75 = np.percentile(parkinson_data['MDVP:Fhi(Hz)'], 75)
iqr_fhi = fhi_75 - fhi_25
cutoff_fhi = 1.5 * iqr_fhi
low_lim_fhi = fhi_25 - cutoff_fhi
upp_lim_fhi = fhi_75 + cutoff_fhi
outlier_fhi = [x for x in parkinson_data['MDVP:Fhi(Hz)'] if x < low_lim_fhi or x > upp_lim_fhi]
print("The number of outliers in 'MDVP:Fhi(Hz)' out off 195 records are:", len(outlier_fhi))
The number of outliers in 'MDVP:Fhi(Hz)' out off 195 records are: 11
Thus, there are 11 values in 'MDVP:Fhi(Hz)' which are extreme as compared to other observations in the same attribute .
# plotting of 'MDVP:Fdlo(Hz)':
sns.distplot(parkinson_data['MDVP:Flo(Hz)'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc63149188>
From the graph it is clear that the curve is skewed positively.
# measure of skewness of 'MDVP:Flo(Hz)':
parkinson_data['MDVP:Flo(Hz)'].skew()
1.217350448627808
The curve being positively skewed is ascertained here.
# presence of outliers in 'MDVP:Flo(Hz)':
sns.boxplot(parkinson_data['MDVP:Flo(Hz)'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc631eba48>
As, seen from the above plot, there are some outliers in 'MDVP:Flo(Hz)'. The number of outliers can be calculated as:
flo_25 = np.percentile(parkinson_data['MDVP:Flo(Hz)'], 25)
flo_75 = np.percentile(parkinson_data['MDVP:Flo(Hz)'], 75)
iqr_flo = flo_75 - flo_25
cutoff_flo = 1.5 * iqr_flo
low_lim_flo = flo_25 - cutoff_flo
upp_lim_flo = flo_75 + cutoff_flo
outlier_flo = [x for x in parkinson_data['MDVP:Flo(Hz)'] if x < low_lim_flo or x > upp_lim_flo]
print("The number of outliers in 'MDVP:Flo(Hz)' out off 5000 records are:", len(outlier_flo))
The number of outliers in 'MDVP:Flo(Hz)' out off 5000 records are: 9
Thus, there are 9 values in 'MDVP:Flo(Hz)' which are extreme as compared to other observations in the same attribute .
# plotting of 'MDVP:Jitter(%)':
sns.distplot(parkinson_data['MDVP:Jitter(%)'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc632589c8>
The curve is a highly skewed one, thereby suggesting the presence of outliers.
# measure of skewness of 'MDVP:Jitter(%)':
parkinson_data['MDVP:Jitter(%)'].skew()
3.0849462014441817
The curve is highly positively skewed.
# presence of outliers in 'MDVP:Jitter(%)':
sns.boxplot(parkinson_data['MDVP:Jitter(%)'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc63323808>
As seen from the above plot, 'MDVP:Jitter(%)' does have outliers. The number of outliers can be calculated as:
jit_25 = np.percentile(parkinson_data['MDVP:Jitter(%)'], 25)
jit_75 = np.percentile(parkinson_data['MDVP:Jitter(%)'], 75)
iqr_jit = jit_75 - jit_25
cutoff_jit = 1.5 * iqr_jit
low_lim_jit = jit_25 - cutoff_jit
upp_lim_jit = jit_75 + cutoff_jit
outlier_jit = [x for x in parkinson_data['MDVP:Jitter(%)'] if x < low_lim_jit or x > upp_lim_jit]
print("The number of outliers in 'MDVP:Jitter(%)' out off 5000 records are:", len(outlier_jit))
The number of outliers in 'MDVP:Jitter(%)' out off 5000 records are: 14
Thus, there are 14 values in 'MDVP:Jitter(%)' which are extreme as compared to other observations in the same attribute .
# plotting of 'MDVP:Jitter(Abs)':
sns.distplot(parkinson_data['MDVP:Jitter(Abs)'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc63392588>
The curve is a highly skewed one, thereby suggesting the presence of outliers.
# measure of skewness of 'MDVP:Jitter(Abs)':
parkinson_data['MDVP:Jitter(Abs)'].skew()
2.6490714165257274
The curve is highly positively skewed.
# presence of outliers in 'MDVP:Jitter(Abs)':
sns.boxplot(parkinson_data['MDVP:Jitter(Abs)'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc63451648>
From the above plot it is clear that outliers are present in 'MDVP:Jitter(Abs)'. The number of outliers can be calculated as:
jitab_25 = np.percentile(parkinson_data['MDVP:Jitter(Abs)'], 25)
jitab_75 = np.percentile(parkinson_data['MDVP:Jitter(Abs)'], 75)
iqr_jitab = jitab_75 - jitab_25
cutoff_jitab = 1.5 * iqr_jitab
low_lim_jitab = jitab_25 - cutoff_jitab
upp_lim_jitab = jitab_75 + cutoff_jitab
outlier_jitab = [x for x in parkinson_data['MDVP:Jitter(Abs)'] if x < low_lim_jitab or x > upp_lim_jitab]
print("The number of outliers in 'MDVP:Jitter(Abs)' out off 195 records are:", len(outlier_jitab))
The number of outliers in 'MDVP:Jitter(Abs)' out off 195 records are: 6
Thus, there are 6 values in 'MDVP:Jitter(Abs)' which are extreme as compared to other observations in the same attribute .
# plotting of 'MDVP:RAP':
sns.distplot(parkinson_data['MDVP:RAP'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc634c8d88>
The curve is a highly skewed one, thereby suggesting the presence of outliers.
# measure of skewness of 'MDVP:RAP':
parkinson_data['MDVP:RAP'].skew()
3.360708450480554
The curve is highly positively skewed.
# presence of outliers in 'MDVP:RAP':
sns.boxplot(parkinson_data['MDVP:RAP'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc63591408>
From the above plot it is evident that 'MDVP:RAP' does have outliers in them. The number of outliers can be calculated as:
rap_25 = np.percentile(parkinson_data['MDVP:RAP'], 25)
rap_75 = np.percentile(parkinson_data['MDVP:RAP'], 75)
iqr_rap = rap_75 - rap_25
cutoff_rap = 1.5 * iqr_rap
low_lim_rap = rap_25 - cutoff_rap
upp_lim_rap = rap_75 + cutoff_rap
outlier_rap = [x for x in parkinson_data['MDVP:RAP'] if x < low_lim_rap or x > upp_lim_rap]
print("The number of outliers in 'MDVP:RAP' out off 195 records are:", len(outlier_rap))
The number of outliers in 'MDVP:RAP' out off 195 records are: 14
Thus, there are 14 values in 'MDVP:RAP' which are extreme as compared to other observations in the same attribute .
# plotting of 'MDVP:PPQ':
sns.distplot(parkinson_data['MDVP:PPQ'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc63602848>
The curve is a highly skewed one, thereby suggesting the presence of outliers.
# measure of skewness in 'MDVP:PPQ':
parkinson_data['MDVP:PPQ'].skew()
3.073892457888517
The curve is highly positively skewed.
# presence of outliers in 'MDVP:PPQ':
sns.boxplot(parkinson_data['MDVP:PPQ'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc6369e8c8>
From the above plot it is clear that 'MDVP:PPQ' does have outliers in them. The number of outliers can be calculated as:
ppq_25 = np.percentile(parkinson_data['MDVP:PPQ'], 25)
ppq_75 = np.percentile(parkinson_data['MDVP:PPQ'], 75)
iqr_ppq = ppq_75 - ppq_25
cutoff_ppq = 1.5 * iqr_ppq
low_lim_ppq = ppq_25 - cutoff_ppq
upp_lim_ppq = ppq_75 + cutoff_ppq
outlier_ppq = [x for x in parkinson_data['MDVP:PPQ'] if x < low_lim_ppq or x > upp_lim_ppq]
print("The number of outliers in 'MDVP:PPQ' out off 195 records are:", len(outlier_ppq))
The number of outliers in 'MDVP:PPQ' out off 195 records are: 15
Thus, the attribute 'MDVP:PPQ' have 15 extreme values.
# plotting of 'Jitter:DDP':
sns.distplot(parkinson_data['Jitter:DDP'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc63721a08>
The curve is a highly skewed one, thereby suggesting the presence of outliers.
# measure of skewness in 'Jitter:DDP':
parkinson_data['Jitter:DDP'].skew()
3.3620584478857203
The curve is highly positively skewed.
# presence of outliers in 'Jitter:DDP':
sns.boxplot(parkinson_data['Jitter:DDP'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc637e7688>
From the above plot it is clear that 'Jitter:DDP' have outliers. The number of outliers can be calculated as:
ddp_25 = np.percentile(parkinson_data['Jitter:DDP'], 25)
ddp_75 = np.percentile(parkinson_data['Jitter:DDP'], 75)
iqr_ddp = ddp_75 - ddp_25
cutoff_ddp = 1.5 * iqr_ddp
low_lim_ddp = ddp_25 - cutoff_ddp
upp_lim_ddp = ddp_75 + cutoff_ddp
outlier_ddp = [x for x in parkinson_data['Jitter:DDP'] if x < low_lim_ddp or x > upp_lim_ddp]
print("The number of outliers in 'Jitter:DDP' out off 195 records are:", len(outlier_ddp))
The number of outliers in 'Jitter:DDP' out off 195 records are: 14
Thus, in 'Jitter:DDP' there are 14 extreme values as compared to its other values.
# plotting of 'MDVP:Shimmer':
sns.distplot(parkinson_data['MDVP:Shimmer'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc63440fc8>
From the graph it is clear that it is positively skewed.
# measure of skewness in 'MDVP:Shimmer':
parkinson_data['MDVP:Shimmer'].skew()
1.6664804101559663
The curve being positively skewed is being ascertained here.
# presence of outliers in 'MDVP:Shimmer':
sns.boxplot(parkinson_data['MDVP:Shimmer'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc6488d608>
From the above plot it is clear that the attribute 'MDVP:Shimmer' have outliers. The number of outliers can be calculated as:
shi_25 = np.percentile(parkinson_data['MDVP:Shimmer'], 25)
shi_75 = np.percentile(parkinson_data['MDVP:Shimmer'], 75)
iqr_shi = shi_75 - shi_25
cutoff_shi = 1.5 * iqr_shi
low_lim_shi = shi_25 - cutoff_shi
upp_lim_shi = shi_75 + cutoff_shi
outlier_shi = [x for x in parkinson_data['MDVP:Shimmer'] if x < low_lim_shi or x > upp_lim_shi]
print("The number of outliers in 'MDVP:Shimmer' out off 195 records are:", len(outlier_shi))
The number of outliers in 'MDVP:Shimmer' out off 195 records are: 8
Thus, there are 8 values in 'MDVP:Shimmer' that are being calculated as extremes.
# plotting of 'MDVP:Shimmer(dB)':
sns.distplot(parkinson_data['MDVP:Shimmer(dB)'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64935548>
The curve is positively skewed.
# measure of skewness in 'MDVP:Shimmer(dB)':
parkinson_data['MDVP:Shimmer(dB)'].skew()
1.999388639086127
The curve being positively skewed is ascertained here.
# presence of outlier in 'MDVP:Shimmer(dB)':
sns.boxplot(parkinson_data['MDVP:Shimmer(dB)'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc649f1188>
'MDVP:Shimmer(dB)' have outliers in them. The number of outliers can be calculated as:
shidB_25 = np.percentile(parkinson_data['MDVP:Shimmer(dB)'], 25)
shidB_75 = np.percentile(parkinson_data['MDVP:Shimmer(dB)'], 75)
iqr_shidB = shidB_75 - shidB_25
cutoff_shidB = 1.5 * iqr_shidB
low_lim_shidB = shidB_25 - cutoff_shidB
upp_lim_shidB = shidB_75 + cutoff_shidB
outlier_shidB = [x for x in parkinson_data['MDVP:Shimmer(dB)'] if x < low_lim_shidB or x > upp_lim_shidB]
print("The number of outliers in 'MDVP:Shimmer(dB)' out off 195 records are:", len(outlier_shidB))
The number of outliers in 'MDVP:Shimmer(dB)' out off 195 records are: 10
Thus, 10 values in 'MDVP:Shimmer(dB)' have extreme values in them.
# plotting of 'Shimmer:APQ3':
sns.distplot(parkinson_data['Shimmer:APQ3'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64a60f88>
From the above plot it is clear that the attribute does have outliers in them.
# measure of skewness in 'Shimmer:APQ3':
parkinson_data['Shimmer:APQ3'].skew()
1.5805763798815677
The curve is positively skewed.
# presence of outliers in 'Shimmer:APQ3':
sns.boxplot(parkinson_data['Shimmer:APQ3'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64b0c048>
The number of outliers in 'Shimmer:APQ3' can be calculated as:
apq3_25 = np.percentile(parkinson_data['Shimmer:APQ3'], 25)
apq3_75 = np.percentile(parkinson_data['Shimmer:APQ3'], 75)
iqr_apq3 = apq3_75 - apq3_25
cutoff_apq3 = 1.5 * iqr_apq3
low_lim_apq3 = apq3_25 - cutoff_apq3
upp_lim_apq3 = apq3_75 + cutoff_apq3
outlier_apq3 = [x for x in parkinson_data['Shimmer:APQ3'] if x < low_lim_apq3 or x > upp_lim_apq3]
print("The number of outliers in 'Shimmer:APQ3' out off 195 records are:", len(outlier_apq3))
The number of outliers in 'Shimmer:APQ3' out off 195 records are: 6
Thus, there are 6 values in 'Shimmer:APQ3' which are considered as outliers.
# plotting of 'Shimmer:APQ5':
sns.distplot(parkinson_data['Shimmer:APQ5'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64b75ec8>
The curve is positively skewed with presence of outliers in them.
# measure of skewness in 'Shimmer:APQ5':
parkinson_data['Shimmer:APQ5'].skew()
1.798697066537622
The curve is positively skewed here.
# presence of outliers in 'Shimmer:APQ5':
sns.boxplot(parkinson_data['Shimmer:APQ5'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64b06108>
The number of outliers in 'Shimmer:APQ5' is calculated as:
apq5_25 = np.percentile(parkinson_data['Shimmer:APQ5'], 25)
apq5_75 = np.percentile(parkinson_data['Shimmer:APQ5'], 75)
iqr_apq5 = apq5_75 - apq5_25
cutoff_apq5 = 1.5 * iqr_apq5
low_lim_apq5 = apq5_25 - cutoff_apq5
upp_lim_apq5 = apq5_75 + cutoff_apq5
outlier_apq5 = [x for x in parkinson_data['Shimmer:APQ5'] if x < low_lim_apq5 or x > upp_lim_apq5]
print("The number of outliers in 'Shimmer:APQ5' out off 195 records are:", len(outlier_apq5))
The number of outliers in 'Shimmer:APQ5' out off 195 records are: 13
Thus, there are 13 extreme values in 'Shimmer:APQ5'.
# plotting of 'MDVP:APQ':
sns.distplot(parkinson_data['MDVP:APQ'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64c8f688>
The curve is highly positively skewed with presence of outliers in them.
# measure of skewness in 'MDVP:APQ':
parkinson_data['MDVP:APQ'].skew()
2.618046502215422
The curve is highly positively skewed.
# presence of outliers in 'MDVP:APQ':
sns.boxplot(parkinson_data['MDVP:APQ'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64d53448>
From the above plot it is clear that there are outliers in 'MDVP:APQ' whose strength can be calculated as:
apq_25 = np.percentile(parkinson_data['MDVP:APQ'], 25)
apq_75 = np.percentile(parkinson_data['MDVP:APQ'], 75)
iqr_apq = apq_75 - apq_25
cutoff_apq = 1.5 * iqr_apq
low_lim_apq = apq_25 - cutoff_apq
upp_lim_apq = apq_75 + cutoff_apq
outlier_apq = [x for x in parkinson_data['MDVP:APQ'] if x < low_lim_apq or x > upp_lim_apq]
print("The number of outliers in 'MDVP:APQ' out off 195 records are:", len(outlier_apq))
The number of outliers in 'MDVP:APQ' out off 195 records are: 12
Thus, 12 values in 'MDVP:APQ' are considered extremes.
# plotting of 'Shimmer:DDA':
sns.distplot(parkinson_data['Shimmer:DDA'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64dc4b08>
The curve is positively skewed.
# measure of skewness in 'Shimmer:DDA':
parkinson_data['Shimmer:DDA'].skew()
1.5806179936782263
The curve being positively skewed is ascertained here.
# presence of outliers in 'Shimmer:DDA':
sns.boxplot(parkinson_data['Shimmer:DDA'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64e65a08>
The number of outliers in 'Shimmet:DDA' can be calculated as:
dda_25 = np.percentile(parkinson_data['Shimmer:DDA'], 25)
dda_75 = np.percentile(parkinson_data['Shimmer:DDA'], 75)
iqr_dda = dda_75 - dda_25
cutoff_dda = 1.5 * iqr_dda
low_lim_dda = dda_25 - cutoff_dda
upp_lim_dda = dda_75 + cutoff_dda
outlier_dda = [x for x in parkinson_data['Shimmer:DDA'] if x < low_lim_dda or x > upp_lim_dda]
print("The number of outliers in 'Shimmer:DDA' out off 195 records are:", len(outlier_dda))
The number of outliers in 'Shimmer:DDA' out off 195 records are: 6
Thus, there are 6 extreme values in 'Shimmer:DDA'.
# plotting of 'NHR':
sns.distplot(parkinson_data['NHR'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64ed2b88>
The curve is highly positively skewed with presence of outliers.
# measure of skewness in 'NHR':
parkinson_data['NHR'].skew()
4.22070912913906
The curve is highly positively skewed,
# presence of outliers in 'NHR':
sns.boxplot(parkinson_data['NHR'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64fdf588>
The number of outliers in 'NHR' can be calculated as:
nhr_25 = np.percentile(parkinson_data['NHR'], 25)
nhr_75 = np.percentile(parkinson_data['NHR'], 75)
iqr_nhr = nhr_75 - nhr_25
cutoff_nhr = 1.5 * iqr_nhr
low_lim_nhr = nhr_25 - cutoff_nhr
upp_lim_nhr = nhr_75 + cutoff_nhr
outlier_nhr = [x for x in parkinson_data['NHR'] if x < low_lim_nhr or x > upp_lim_nhr]
print("The number of outliers in 'NHR' out off 195 records are:", len(outlier_nhr))
The number of outliers in 'NHR' out off 195 records are: 19
Thus, there are 19 values in 'NHR' that are considered outliers.
# plotting of 'HNR':
sns.distplot(parkinson_data['HNR'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc64bbe4c8>
The curve is slighty negatively skewed.
# measure of skewness in 'HNR':
parkinson_data['HNR'].skew()
-0.5143174975652068
The curve beingg negatively skewed is ascertained here.
# prsence of outlier in 'HNR':
sns.boxplot(parkinson_data['HNR'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc650a4d08>
From the above plot it is clear that 'HNR' does have outliers in them. The number of outliers can be calculated as:
hnr_25 = np.percentile(parkinson_data['HNR'], 25)
hnr_75 = np.percentile(parkinson_data['HNR'], 75)
iqr_hnr = hnr_75 - hnr_25
cutoff_hnr = 1.5 * iqr_hnr
low_lim_hnr = hnr_25 - cutoff_hnr
upp_lim_hnr = hnr_75 + cutoff_hnr
outlier_hnr = [x for x in parkinson_data['HNR'] if x < low_lim_hnr or x > upp_lim_hnr]
print("The number of outliers in 'HNR' out off 195 records are:", len(outlier_hnr))
The number of outliers in 'HNR' out off 195 records are: 3
Thus, there are only 3 extreme values in 'HNR'.
# plotting of 'RPDE':
sns.distplot(parkinson_data['RPDE'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc65117108>
From the plot it seems that the attribute is almost normally distributed.
# measure of skewness in 'RPDE':
parkinson_data['RPDE'].skew()
-0.14340241379821705
The curve is slightly negatively skewed.
# presence of outliers in 'RPDE':
sns.boxplot(parkinson_data['RPDE'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc6513bc88>
There are no outliers in 'RPDE'.
# plotting of 'DFA':
sns.distplot(parkinson_data['DFA'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc65208648>
The attribute is almost normally distributed.
# measure of skewness in 'DFA':
parkinson_data['DFA'].skew()
-0.03321366071383484
The curve is slightly negatively skewed.
# presence of outliers in 'DFA':
sns.boxplot(parkinson_data['DFA'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc652bfb48>
From the plot it is clear that 'DFA' doesn't have any outliers in them.
# plotting of 'spread1':
sns.distplot(parkinson_data['spread1'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc6531d4c8>
From the above plot it seems that the attribute is almost normally distributed.
# measure of skewness in 'spread1':
parkinson_data['spread1'].skew()
0.4321389320131796
The curve is slightly positively skewed.
# prsence of outliers in 'spread1':
sns.boxplot(parkinson_data['spread1'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc652bf8c8>
From the above plot it is clear that the attribute 'spread1' does have outliers in them. The number of outliers can be calculated as:
sp1_25 = np.percentile(parkinson_data['spread1'], 25)
sp1_75 = np.percentile(parkinson_data['spread1'], 75)
iqr_sp1 = sp1_75 - sp1_25
cutoff_sp1 = 1.5 * iqr_sp1
low_lim_sp1 = sp1_25 - cutoff_sp1
upp_lim_sp1 = sp1_75 + cutoff_sp1
outlier_sp1 = [x for x in parkinson_data['spread1'] if x < low_lim_sp1 or x > upp_lim_sp1]
print("The number of outliers in 'spread1' out off 195 records are:", len(outlier_sp1))
The number of outliers in 'spread1' out off 195 records are: 4
Thus, there are only 4 extreme values in 'spread1'.
# plotting of 'spread2':
sns.distplot(parkinson_data['spread2'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc6640e0c8>
The attribute is almost normally distributed.
# measure of skewness in 'spread2':
parkinson_data['spread2'].skew()
0.14443048549278412
The curve is slightly positively skewed.
# presence of outliers in 'spread2':
sns.boxplot(parkinson_data['spread2'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc664ae248>
From the plot it is clear that 'spread2' have outliers in them. The number of outliers can be calculated as:
sp2_25 = np.percentile(parkinson_data['spread2'], 25)
sp2_75 = np.percentile(parkinson_data['spread2'], 75)
iqr_sp2 = sp2_75 - sp2_25
cutoff_sp2 = 1.5 * iqr_sp2
low_lim_sp2 = sp2_25 - cutoff_sp2
upp_lim_sp2 = sp2_75 + cutoff_sp2
outlier_sp2 = [x for x in parkinson_data['spread2'] if x < low_lim_sp2 or x > upp_lim_sp2]
print("The number of outliers in 'spread2' out off 195 records are:", len(outlier_sp2))
The number of outliers in 'spread2' out off 195 records are: 2
Thus, in 'spread2' only 2 values are extreme as compared to its other values.
# plotting of 'D2':
sns.distplot(parkinson_data['D2'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc6651bd88>
The attribute is almost normally distributed.
# measure of skewness in 'D2':
parkinson_data['D2'].skew()
0.4303838913329283
The curve is slightly positively skewed.
# presence of outliers in 'D2':
sns.boxplot(parkinson_data['D2'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc60c04688>
From the above plot it is clear that 'D2' have outliers. The number of outliers can be calculated as:
d2_25 = np.percentile(parkinson_data['D2'], 25)
d2_75 = np.percentile(parkinson_data['D2'], 75)
iqr_d2 = d2_75 - d2_25
cutoff_d2 = 1.5 * iqr_d2
low_lim_d2 = d2_25 - cutoff_d2
upp_lim_d2 = d2_75 + cutoff_d2
outlier_d2 = [x for x in parkinson_data['D2'] if x < low_lim_d2 or x > upp_lim_d2]
print("The number of outliers in 'D2' out off 195 records are:", len(outlier_d2))
The number of outliers in 'D2' out off 195 records are: 1
Thus, there is only one outlier in 'D2'.
# plotting of 'PPE':
sns.distplot(parkinson_data['PPE'], rug = True)
<matplotlib.axes._subplots.AxesSubplot at 0x1dc66613808>
The curve is slightly positively skewed.
# measure of skewness in 'PPE':
parkinson_data['PPE'].skew()
0.7974910716463578
The curve being positively skewed is ascertained here.
# presence of outliers in 'PPE':
sns.boxplot(parkinson_data['PPE'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc666bb908>
From the above plot it is clear that the 'PPE' have outliers in them. The number of outliers can be calculated as:
ppe_25 = np.percentile(parkinson_data['PPE'], 25)
ppe_75 = np.percentile(parkinson_data['PPE'], 75)
iqr_ppe = ppe_75 - ppe_25
cutoff_ppe = 1.5 * iqr_ppe
low_lim_ppe = ppe_25 - cutoff_ppe
upp_lim_ppe = ppe_75 + cutoff_ppe
outlier_ppe = [x for x in parkinson_data['PPE'] if x < low_lim_ppe or x > upp_lim_ppe]
print("The number of outliers in 'PPE' out off 195 records are:", len(outlier_ppe))
The number of outliers in 'PPE' out off 195 records are: 5
Thus, there are only 5 extreme values in 'PPE'.
# plotting of 'status':
sns.countplot(parkinson_data['status'])
<matplotlib.axes._subplots.AxesSubplot at 0x1dc66731788>
print('The number of Persons having Parkinson is = ', parkinson_data[parkinson_data['status'] == 1]['status'].count())
print('The number of Persons not having Parkinson is = ', parkinson_data[parkinson_data['status'] == 0]['status'].count())
The number of Persons having Parkinson is = 147 The number of Persons not having Parkinson is = 48
'status' is here the target or the dependent variable. From the above plot it is clear that the number of persons having Parkinson disease is much higher than those not having the disease. The ratio is almost 1:3. So we can assume that the model will have a much better chance to predict status = 1 than predicting status = 0.
Here we will be visualize as to how the different independent attributes vary with respect to the dependent attribute - 'status'.
for i in parkinson_data:
if i != 'status':
sns.catplot(x = 'status', y = i, kind = 'box', data = parkinson_data)
From the boxplot it is clear that if the person has lower value of 'MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'HNR', then then person is affected by Parkinson.
From the boxplot it is clear that if the person has higher value of 'MDVP:Jitter(%)', 'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP', 'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'RPDE', 'DFA', 'spread1', 'spread2', 'D2' and 'PPE' then then person is affected by Parkinson.
This plot along with correlation matrix and heatmap will help us to analyze the relationship between the different attributes.
sns.pairplot(parkinson_data, hue = 'status')
<seaborn.axisgrid.PairGrid at 0x1dc682675c8>
# calculating the correlation coefficient
corr = parkinson_data.corr()
corr
| MDVP:Fo(Hz) | MDVP:Fhi(Hz) | MDVP:Flo(Hz) | MDVP:Jitter(%) | MDVP:Jitter(Abs) | MDVP:RAP | MDVP:PPQ | Jitter:DDP | MDVP:Shimmer | MDVP:Shimmer(dB) | ... | Shimmer:DDA | NHR | HNR | status | RPDE | DFA | spread1 | spread2 | D2 | PPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MDVP:Fo(Hz) | 1.000000 | 0.400985 | 0.596546 | -0.118003 | -0.382027 | -0.076194 | -0.112165 | -0.076213 | -0.098374 | -0.073742 | ... | -0.094732 | -0.021981 | 0.059144 | -0.383535 | -0.383894 | -0.446013 | -0.413738 | -0.249450 | 0.177980 | -0.372356 |
| MDVP:Fhi(Hz) | 0.400985 | 1.000000 | 0.084951 | 0.102086 | -0.029198 | 0.097177 | 0.091126 | 0.097150 | 0.002281 | 0.043465 | ... | -0.003733 | 0.163766 | -0.024893 | -0.166136 | -0.112404 | -0.343097 | -0.076658 | -0.002954 | 0.176323 | -0.069543 |
| MDVP:Flo(Hz) | 0.596546 | 0.084951 | 1.000000 | -0.139919 | -0.277815 | -0.100519 | -0.095828 | -0.100488 | -0.144543 | -0.119089 | ... | -0.150737 | -0.108670 | 0.210851 | -0.380200 | -0.400143 | -0.050406 | -0.394857 | -0.243829 | -0.100629 | -0.340071 |
| MDVP:Jitter(%) | -0.118003 | 0.102086 | -0.139919 | 1.000000 | 0.935714 | 0.990276 | 0.974256 | 0.990276 | 0.769063 | 0.804289 | ... | 0.746635 | 0.906959 | -0.728165 | 0.278220 | 0.360673 | 0.098572 | 0.693577 | 0.385123 | 0.433434 | 0.721543 |
| MDVP:Jitter(Abs) | -0.382027 | -0.029198 | -0.277815 | 0.935714 | 1.000000 | 0.922911 | 0.897778 | 0.922913 | 0.703322 | 0.716601 | ... | 0.697170 | 0.834972 | -0.656810 | 0.338653 | 0.441839 | 0.175036 | 0.735779 | 0.388543 | 0.310694 | 0.748162 |
| MDVP:RAP | -0.076194 | 0.097177 | -0.100519 | 0.990276 | 0.922911 | 1.000000 | 0.957317 | 1.000000 | 0.759581 | 0.790652 | ... | 0.744919 | 0.919521 | -0.721543 | 0.266668 | 0.342140 | 0.064083 | 0.648328 | 0.324407 | 0.426605 | 0.670999 |
| MDVP:PPQ | -0.112165 | 0.091126 | -0.095828 | 0.974256 | 0.897778 | 0.957317 | 1.000000 | 0.957319 | 0.797826 | 0.839239 | ... | 0.763592 | 0.844604 | -0.731510 | 0.288698 | 0.333274 | 0.196301 | 0.716489 | 0.407605 | 0.412524 | 0.769647 |
| Jitter:DDP | -0.076213 | 0.097150 | -0.100488 | 0.990276 | 0.922913 | 1.000000 | 0.957319 | 1.000000 | 0.759555 | 0.790621 | ... | 0.744901 | 0.919548 | -0.721494 | 0.266646 | 0.342079 | 0.064026 | 0.648328 | 0.324377 | 0.426556 | 0.671005 |
| MDVP:Shimmer | -0.098374 | 0.002281 | -0.144543 | 0.769063 | 0.703322 | 0.759581 | 0.797826 | 0.759555 | 1.000000 | 0.987258 | ... | 0.987626 | 0.722194 | -0.835271 | 0.367430 | 0.447424 | 0.159954 | 0.654734 | 0.452025 | 0.507088 | 0.693771 |
| MDVP:Shimmer(dB) | -0.073742 | 0.043465 | -0.119089 | 0.804289 | 0.716601 | 0.790652 | 0.839239 | 0.790621 | 0.987258 | 1.000000 | ... | 0.963202 | 0.744477 | -0.827805 | 0.350697 | 0.410684 | 0.165157 | 0.652547 | 0.454314 | 0.512233 | 0.695058 |
| Shimmer:APQ3 | -0.094717 | -0.003743 | -0.150747 | 0.746625 | 0.697153 | 0.744912 | 0.763580 | 0.744894 | 0.987625 | 0.963198 | ... | 1.000000 | 0.716207 | -0.827123 | 0.347617 | 0.435242 | 0.151124 | 0.610967 | 0.402243 | 0.467265 | 0.645377 |
| Shimmer:APQ5 | -0.070682 | -0.009997 | -0.101095 | 0.725561 | 0.648961 | 0.709927 | 0.786780 | 0.709907 | 0.982835 | 0.973751 | ... | 0.960072 | 0.658080 | -0.813753 | 0.351148 | 0.399903 | 0.213873 | 0.646809 | 0.457195 | 0.502174 | 0.702456 |
| MDVP:APQ | -0.077774 | 0.004937 | -0.107293 | 0.758255 | 0.648793 | 0.737455 | 0.804139 | 0.737439 | 0.950083 | 0.960977 | ... | 0.896647 | 0.694019 | -0.800407 | 0.364316 | 0.451379 | 0.157276 | 0.673158 | 0.502188 | 0.536869 | 0.721694 |
| Shimmer:DDA | -0.094732 | -0.003733 | -0.150737 | 0.746635 | 0.697170 | 0.744919 | 0.763592 | 0.744901 | 0.987626 | 0.963202 | ... | 1.000000 | 0.716215 | -0.827130 | 0.347608 | 0.435237 | 0.151132 | 0.610971 | 0.402223 | 0.467261 | 0.645389 |
| NHR | -0.021981 | 0.163766 | -0.108670 | 0.906959 | 0.834972 | 0.919521 | 0.844604 | 0.919548 | 0.722194 | 0.744477 | ... | 0.716215 | 1.000000 | -0.714072 | 0.189429 | 0.370890 | -0.131882 | 0.540865 | 0.318099 | 0.470949 | 0.552591 |
| HNR | 0.059144 | -0.024893 | 0.210851 | -0.728165 | -0.656810 | -0.721543 | -0.731510 | -0.721494 | -0.835271 | -0.827805 | ... | -0.827130 | -0.714072 | 1.000000 | -0.361515 | -0.598736 | -0.008665 | -0.673210 | -0.431564 | -0.601401 | -0.692876 |
| status | -0.383535 | -0.166136 | -0.380200 | 0.278220 | 0.338653 | 0.266668 | 0.288698 | 0.266646 | 0.367430 | 0.350697 | ... | 0.347608 | 0.189429 | -0.361515 | 1.000000 | 0.308567 | 0.231739 | 0.564838 | 0.454842 | 0.340232 | 0.531039 |
| RPDE | -0.383894 | -0.112404 | -0.400143 | 0.360673 | 0.441839 | 0.342140 | 0.333274 | 0.342079 | 0.447424 | 0.410684 | ... | 0.435237 | 0.370890 | -0.598736 | 0.308567 | 1.000000 | -0.110950 | 0.591117 | 0.479905 | 0.236931 | 0.545886 |
| DFA | -0.446013 | -0.343097 | -0.050406 | 0.098572 | 0.175036 | 0.064083 | 0.196301 | 0.064026 | 0.159954 | 0.165157 | ... | 0.151132 | -0.131882 | -0.008665 | 0.231739 | -0.110950 | 1.000000 | 0.195668 | 0.166548 | -0.165381 | 0.270445 |
| spread1 | -0.413738 | -0.076658 | -0.394857 | 0.693577 | 0.735779 | 0.648328 | 0.716489 | 0.648328 | 0.654734 | 0.652547 | ... | 0.610971 | 0.540865 | -0.673210 | 0.564838 | 0.591117 | 0.195668 | 1.000000 | 0.652358 | 0.495123 | 0.962435 |
| spread2 | -0.249450 | -0.002954 | -0.243829 | 0.385123 | 0.388543 | 0.324407 | 0.407605 | 0.324377 | 0.452025 | 0.454314 | ... | 0.402223 | 0.318099 | -0.431564 | 0.454842 | 0.479905 | 0.166548 | 0.652358 | 1.000000 | 0.523532 | 0.644711 |
| D2 | 0.177980 | 0.176323 | -0.100629 | 0.433434 | 0.310694 | 0.426605 | 0.412524 | 0.426556 | 0.507088 | 0.512233 | ... | 0.467261 | 0.470949 | -0.601401 | 0.340232 | 0.236931 | -0.165381 | 0.495123 | 0.523532 | 1.000000 | 0.480585 |
| PPE | -0.372356 | -0.069543 | -0.340071 | 0.721543 | 0.748162 | 0.670999 | 0.769647 | 0.671005 | 0.693771 | 0.695058 | ... | 0.645389 | 0.552591 | -0.692876 | 0.531039 | 0.545886 | 0.270445 | 0.962435 | 0.644711 | 0.480585 | 1.000000 |
23 rows × 23 columns
# plotting a heatmap
plt.figure(figsize = (30,10))
ax = sns.heatmap(corr, annot = True, cmap = "ocean_r")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
(23.5, -0.5)
Thus, from the above three we can see that MDVP:Jitter(%) and MDVP:RAP, MDVP:Jitter(%) and Jitter(DDP), MDVP:Shimmer and MDVP:Shimmer(dB), MDVP:Shimmer and Shimmer:APQ3 and MDVP:Shimmer and Shimmer:DDA all have a correlation value of 0.99.
As MDVP:Jitter(%) has a correlation of 0.99 with MDVP:RAP and Jitter(DDP) and similarly, MDVP:Shimmer has a correlation of 0.99 with MDVP:Shimmer(dB), Shimmer:APQ3 and Shimmer:DDA, so, we can drop off attributes: MDVP:Jitter(%) and MDVP:Shimmer.
# dropping 'MDVP:Jitter(%)' and 'MDVP:Shimmer' from the dataframe:
parkinson_data.drop(['MDVP:Jitter(%)', 'MDVP:Shimmer'], axis = 1, inplace = True)
parkinson_data.head().T
| name | phon_R01_S01_1 | phon_R01_S01_2 | phon_R01_S01_3 | phon_R01_S01_4 | phon_R01_S01_5 |
|---|---|---|---|---|---|
| MDVP:Fo(Hz) | 119.992000 | 122.400000 | 116.682000 | 116.676000 | 116.014000 |
| MDVP:Fhi(Hz) | 157.302000 | 148.650000 | 131.111000 | 137.871000 | 141.781000 |
| MDVP:Flo(Hz) | 74.997000 | 113.819000 | 111.555000 | 111.366000 | 110.655000 |
| MDVP:Jitter(Abs) | 0.000070 | 0.000080 | 0.000090 | 0.000090 | 0.000110 |
| MDVP:RAP | 0.003700 | 0.004650 | 0.005440 | 0.005020 | 0.006550 |
| MDVP:PPQ | 0.005540 | 0.006960 | 0.007810 | 0.006980 | 0.009080 |
| Jitter:DDP | 0.011090 | 0.013940 | 0.016330 | 0.015050 | 0.019660 |
| MDVP:Shimmer(dB) | 0.426000 | 0.626000 | 0.482000 | 0.517000 | 0.584000 |
| Shimmer:APQ3 | 0.021820 | 0.031340 | 0.027570 | 0.029240 | 0.034900 |
| Shimmer:APQ5 | 0.031300 | 0.045180 | 0.038580 | 0.040050 | 0.048250 |
| MDVP:APQ | 0.029710 | 0.043680 | 0.035900 | 0.037720 | 0.044650 |
| Shimmer:DDA | 0.065450 | 0.094030 | 0.082700 | 0.087710 | 0.104700 |
| NHR | 0.022110 | 0.019290 | 0.013090 | 0.013530 | 0.017670 |
| HNR | 21.033000 | 19.085000 | 20.651000 | 20.644000 | 19.649000 |
| status | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| RPDE | 0.414783 | 0.458359 | 0.429895 | 0.434969 | 0.417356 |
| DFA | 0.815285 | 0.819521 | 0.825288 | 0.819235 | 0.823484 |
| spread1 | -4.813031 | -4.075192 | -4.443179 | -4.117501 | -3.747787 |
| spread2 | 0.266482 | 0.335590 | 0.311173 | 0.334147 | 0.234513 |
| D2 | 2.301442 | 2.486855 | 2.342259 | 2.405554 | 2.332180 |
| PPE | 0.284654 | 0.368674 | 0.332634 | 0.368975 | 0.410335 |
X = parkinson_data.drop('status', axis = 1)
y = parkinson_data['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)
Here, the independent variables are denoted by 'X' and the predictor is represented by 'y'.
We will also standardize the dataset:
# standardization of the training and test data set
scaled_X_train = preprocessing.StandardScaler().fit_transform(X_train)
scaled_X_test = preprocessing.StandardScaler().fit_transform(X_test)
LogReg_model = LogisticRegression(random_state = 1)
LogReg_model.fit(scaled_X_train, y_train)
LogisticRegression(random_state=1)
pred_log = LogReg_model.predict(scaled_X_test)
predictprob_log = LogReg_model.predict_proba(scaled_X_test)
pred_log
type(scaled_X_test)
numpy.ndarray
# print classification report and accuracy score:
print('Classification report for the model after scaling is given as:', '\n', classification_report(y_test, pred_log))
print('Accuracy obtained from the given model after scaling is:', accuracy_score(y_test, pred_log))
Classification report for the model after scaling is given as:
precision recall f1-score support
0 0.91 0.53 0.67 19
1 0.81 0.97 0.89 40
accuracy 0.83 59
macro avg 0.86 0.75 0.78 59
weighted avg 0.84 0.83 0.82 59
Accuracy obtained from the given model after scaling is: 0.8305084745762712
The accuracy obtained with the standardized data set is 86.44%.
# Confusion Matrix:
cm_log = confusion_matrix(y_test, pred_log)
class_label = ['Positive', 'Negative']
df_cm_log = pd.DataFrame(cm_log, index = class_label, columns = class_label)
ax = sns.heatmap(df_cm_log, annot = True, fmt = 'd')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
Thus, the LogisticRegression Classifier has predicted 11 records as positive which were actually positive but 8 positive values as negative. Also, the model didn't predicted any actual negative values as positive. The model was able to successfully predict 40 negative records.
# Creating odd list of K for KNN
myList = list(range(1,20))
# subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))
# Empty list to hold accuracy scores
ac_scores_knn = []
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(scaled_X_train, y_train)
y_pred = knn.predict(scaled_X_test)
scores = accuracy_score(y_test, y_pred)
ac_scores_knn.append(scores)
MSE = [1 - x for x in ac_scores_knn]
optimal_k = neighbors[MSE.index(min(MSE))]
print('The optimal number of neighbors is %d' % optimal_k)
plt.plot(neighbors, ac_scores_knn)
The optimal number of neighbors is 1
[<matplotlib.lines.Line2D at 0x1dc7d99ce48>]
So, here we will consider the value of k = 1.
knn_model = KNeighborsClassifier(n_neighbors = 1, weights = 'uniform', metric = 'euclidean')
knn_model.fit(scaled_X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
pred_knn = knn_model.predict(scaled_X_test)
predictprob_knn = knn_model.predict_proba(scaled_X_test)
# print classification report and accuracy score:
print('Classification report for the model is given as:', '\n', classification_report(y_test, pred_knn))
print('Accuracy obtained from the given model is:', accuracy_score(y_test, pred_knn))
Classification report for the model is given as:
precision recall f1-score support
0 0.89 0.84 0.86 19
1 0.93 0.95 0.94 40
accuracy 0.92 59
macro avg 0.91 0.90 0.90 59
weighted avg 0.91 0.92 0.91 59
Accuracy obtained from the given model is: 0.9152542372881356
Thus, the accuracy obtained from the model based on KNN is 91.52%.
# Confusion Matrix:
cm_knn = confusion_matrix(y_test, pred_knn)
class_label = ['Positive', 'Negative']
df_cm_knn = pd.DataFrame(cm_knn, index = class_label, columns = class_label)
ax = sns.heatmap(df_cm_knn, annot = True, fmt = 'd')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
From the above Classification Matrix it is clear that the model has predicted 16 records as Positive which were actually positive and 3 records as False Positive. Also, the model had correctly identified 38 records as Negative. However, the model had predicted 2 records as Negative which were actually Positive.
naive_model = GaussianNB()
naive_model.fit(scaled_X_train, y_train)
GaussianNB(priors=None, var_smoothing=1e-09)
pred_nb = naive_model.predict(scaled_X_test)
predictprob_nb = naive_model.predict_proba(scaled_X_test)
# print classification report and accuracy score:
print('Classification report for the model is given as:', '\n', classification_report(y_test, pred_nb))
print('Accuracy obtained from the given model is:', accuracy_score(y_test, pred_nb))
Classification report for the model is given as:
precision recall f1-score support
0 0.50 0.68 0.58 19
1 0.82 0.68 0.74 40
accuracy 0.68 59
macro avg 0.66 0.68 0.66 59
weighted avg 0.72 0.68 0.69 59
Accuracy obtained from the given model is: 0.6779661016949152
Thus, the accuracy obtained from the model based on Naive-Bayes classifier is 67.8%.
# Confusion Matrix:
cm_nb = confusion_matrix(y_test, pred_nb)
class_label = ['Positive', 'Negative']
df_cm_nb = pd.DataFrame(cm_nb, index = class_label, columns = class_label)
ax = sns.heatmap(df_cm_nb, annot = True, fmt = 'd')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
From the above Classification Matrix it is clear that the model has predicted 13 records as Positive which were actually positive and 6 records as False Positive. Also, the model had correctly identified 27 records as Negative. However, the model had also predicted 13 records as False Negative.
C_range = 10. ** np.arange(-3,8)
gamma_range = 10. **np.arange(-5,4)
param_grid = dict(gamma = gamma_range, C = C_range)
grid = GridSearchCV(SVC(class_weight = 'balanced'), param_grid = param_grid, cv = StratifiedKFold(n_splits = 5))
grid.fit(scaled_X_train, y_train)
print("The best classifier is: ", grid.best_estimator_)
score_dict = grid.cv_results_
scores = score_dict.get('mean_test_score')
scores = np.array(scores).reshape(len(C_range), len(gamma_range))
plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=0.15, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest')
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
plt.yticks(np.arange(len(C_range)), C_range)
plt.show()
The best classifier is: SVC(C=10.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
svm_model = SVC(C = 10.0, gamma = 0.1, class_weight = 'balanced', probability = True, random_state = 1)
svm_model.fit(scaled_X_train, y_train)
SVC(C=10.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
max_iter=-1, probability=True, random_state=1, shrinking=True, tol=0.001,
verbose=False)
pred_svm = svm_model.predict(scaled_X_test)
predictprob_svm = svm_model.predict_proba(scaled_X_test)
# print classification report and accuracy score:
print('Classification report for the model is given as:', '\n', classification_report(y_test, pred_svm))
print('Accuracy obtained from the given model is:', accuracy_score(y_test, pred_svm))
Classification report for the model is given as:
precision recall f1-score support
0 1.00 0.79 0.88 19
1 0.91 1.00 0.95 40
accuracy 0.93 59
macro avg 0.95 0.89 0.92 59
weighted avg 0.94 0.93 0.93 59
Accuracy obtained from the given model is: 0.9322033898305084
Thus, the accuracy obtained from the model based on SVM is 93.22%.
# Confusion Matrix:
cm_svm = confusion_matrix(y_test, pred_svm)
class_label = ['Positive', 'Negative']
df_cm_svm = pd.DataFrame(cm_svm, index = class_label, columns = class_label)
ax = sns.heatmap(df_cm_svm, annot = True, fmt = 'd')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
From the above Classification Matrix it is clear that the model has predicted 15 records as Positive which were actually positive and 4 records as False Positive. Also, the model had correctly identified 40 records as Negative and it didn't predicted any records as positive which were actually negative.
Here a KNeighborsClassifier, Support Vector Classifier (SVC) and a Naive Bayes Classifier (GaussianNB) will be individually trained. The performance of each classifier will be measured using accuracy score. Finally, we will stack the predictions of these classifiers using the StackingCVClassifier object by using Logistic Regression classifier as the meta classifier and compare the results.
knn_stack = KNeighborsClassifier(n_neighbors = 1, weights = 'uniform', metric = 'euclidean')
svm_stack = SVC(C = 10.0, gamma = 0.1, class_weight = 'balanced', probability = True, random_state = 1)
naive_stack = GaussianNB()
logreg_stack = LogisticRegression(random_state = 1)
sclf = StackingClassifier(classifiers = [knn_stack, svm_stack, naive_stack], meta_classifier = logreg_stack)
classifiers = {'KNN': knn_stack, 'SVM': svm_stack, 'Naive Bayes': naive_stack, 'Stack': sclf}
for key in classifiers:
classifier = classifiers[key]
classifier.fit(scaled_X_train, y_train)
results = pd.DataFrame()
for key in classifiers:
# Make prediction on test set
y_pred = classifiers[key].predict(scaled_X_test)
# Save results in pandas dataframe object
results[f"{key}"] = y_pred
# Add the test set to the results object
results["Target"] = y_test
# Probability Distributions Figure
# Set graph style
sns.set(font_scale = 1)
sns.set_style({"axes.facecolor": "1.0", "axes.edgecolor": "0.85", "grid.color": "0.85",
"grid.linestyle": "-", 'axes.labelcolor': '0.4', "xtick.color": "0.4",
'ytick.color': '0.4'})
# Plot
f, ax = plt.subplots(figsize=(13, 4), nrows=1, ncols = 4)
for key, counter in zip(classifiers, range(4)):
# Get predictions
y_pred = results[key]
# Get Accuracy score
auc = accuracy_score(y_test, y_pred)
textstr = f"Accuracy Score: {auc:.3f}"
# Plot false distribution
false_pred = results[results["Target"] == 0]
sns.distplot(false_pred[key], hist=True, kde=False,
bins=int(10), color = 'red',
hist_kws={'edgecolor':'black'}, ax = ax[counter])
# Plot true distribution
true_pred = results[results["Target"] == 1]
sns.distplot(results[key], hist=True, kde=False,
bins=int(10), color = 'green',
hist_kws={'edgecolor':'black'}, ax = ax[counter])
# These are matplotlib.patch.Patch properties
props = dict(boxstyle='round', facecolor='white', alpha=0.5)
# Place a text box in upper left in axes coords
ax[counter].text(0.05, 0.95, textstr, transform=ax[counter].transAxes, fontsize=14,
verticalalignment = "top", bbox=props)
# Set axis limits and labels
ax[counter].set_title(f"{key} Distribution")
ax[counter].set_xlim(0,1)
ax[counter].set_xlabel("Probability")
# Tight layout
plt.tight_layout()
# Save Figure
plt.savefig("Probability Distribution for each Classifier.png", dpi = 1080)
# print classification report and accuracy score:
print('Classification report for the model is given as:', '\n', classification_report(y_test, y_pred))
print('Accuracy obtained from the given model is:', accuracy_score(y_test, y_pred))
Classification report for the model is given as:
precision recall f1-score support
0 1.00 0.74 0.85 19
1 0.89 1.00 0.94 40
accuracy 0.92 59
macro avg 0.94 0.87 0.89 59
weighted avg 0.92 0.92 0.91 59
Accuracy obtained from the given model is: 0.9152542372881356
Thus, here we can say that even though we had stacked the models the overall accuracy against testing data of StackingClassifier (=91.5%) is less than that obtained by Support Vector Machine Classifier.
# Confusion Matrix:
cm_sclf = confusion_matrix(y_test, y_pred)
class_label = ['Positive', 'Negative']
df_cm_sclf = pd.DataFrame(cm_sclf, index = class_label, columns = class_label)
ax = sns.heatmap(df_cm_sclf, annot = True, fmt = 'd')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
From the above Classification Matrix it is clear that the model has predicted 14 records as Positive which were actually positive and 5 records as False Positive. Also, the model had correctly identified 40 records as Negative. However, the model didn't predicted any negative item as positive.
# Decision Tree with criterion = 'gini':
dTree_model = DecisionTreeClassifier(random_state = 1)
dTree_model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1, splitter='best')
pred_dTree = dTree_model.predict(X_test)
predictprob_dTree = dTree_model.predict_proba(X_test)
# print classification report and accuracy score:
print('Classification report for the model is given as:', '\n', classification_report(y_test, pred_dTree))
print('Accuracy obtained from the given model is:', accuracy_score(y_test, pred_dTree))
Classification report for the model is given as:
precision recall f1-score support
0 0.93 0.74 0.82 19
1 0.89 0.97 0.93 40
accuracy 0.90 59
macro avg 0.91 0.86 0.88 59
weighted avg 0.90 0.90 0.89 59
Accuracy obtained from the given model is: 0.8983050847457628
Thus, the accuracy obtained when DecisionTree Classifier is used with criterion = 'gini' is 89.83%
# Confusion Matrix:
cm_dTree = confusion_matrix(y_test, pred_dTree)
class_label = ['Positive', 'Negative']
df_cm_dTree = pd.DataFrame(cm_dTree, index = class_label, columns = class_label)
ax = sns.heatmap(df_cm_dTree, annot = True, fmt = 'd')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
From the above Classification Matrix it is clear that the model has predicted 14 records as Positive which were actually positive and 5 records as False Positive. Also, the model had correctly identified 39 records as Negative and 1 negative record as positive.
train_char_label = ['No', 'Yes']
PD_Tree_File = open('parkinson_tree.dot','w')
dot_data = tree.export_graphviz(dTree_model, out_file = PD_Tree_File, feature_names = list(X_train), class_names = list(train_char_label))
PD_Tree_File.close()
retCode = system("dot -Tpng parkinson_tree.dot -o parkinson_tree.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("parkinson_tree.png"))
Now, without pruning the Decision Tree we obtained a test accuracy of 89.83%. Now let us try to prune the decision tree by changing the value of the argument 'max_depth':
max_depth = []
acc_gini = []
for i in range(1,10):
modelR = DecisionTreeClassifier(criterion = 'gini', random_state = 1, max_depth = i)
modelR.fit(X_train, y_train)
pred = modelR.predict(X_test)
acc_gini.append(accuracy_score(y_test, pred))
max_depth.append(i)
d = pd.DataFrame({'acc_gini':pd.Series(acc_gini), 'max_depth':pd.Series(max_depth)})
plt.plot('max_depth', 'acc_gini', data = d, label = 'gini')
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.legend()
<matplotlib.legend.Legend at 0x1dc7da7a448>
Here, we can see that after max_depth = 5, the accuracy plateaus. So, let us try to model our DecisionTree Model at max_depth = 5
dTree_modelR = DecisionTreeClassifier(max_depth = 5, random_state = 1)
dTree_modelR.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1, splitter='best')
pred_dTreeR = dTree_modelR.predict(X_test)
predictprob_dTreeR = dTree_modelR.predict_proba(X_test)
# print classification report and accuracy score:
print('Classification report for the model is given as:', '\n', classification_report(y_test, pred_dTreeR))
print('Accuracy obtained from the given model is:', accuracy_score(y_test, pred_dTreeR))
Classification report for the model is given as:
precision recall f1-score support
0 0.93 0.74 0.82 19
1 0.89 0.97 0.93 40
accuracy 0.90 59
macro avg 0.91 0.86 0.88 59
weighted avg 0.90 0.90 0.89 59
Accuracy obtained from the given model is: 0.8983050847457628
Thus, we can see that even after pruning the model the acuracy remains the same.
bag_model = BaggingClassifier(base_estimator = dTree_model, random_state = 1)
bag_model.fit(X_train, y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False,
random_state=1,
splitter='best'),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=1, verbose=0, warm_start=False)
pred_bag = bag_model.predict(X_test)
predictprob_bag = bag_model.predict_proba(X_test)
# print classification report and accuracy score:
print('Classification report for the model is given as:', '\n', classification_report(y_test, pred_bag))
print('Accuracy obtained from the given model is:', accuracy_score(y_test, pred_bag))
Classification report for the model is given as:
precision recall f1-score support
0 0.80 0.63 0.71 19
1 0.84 0.93 0.88 40
accuracy 0.83 59
macro avg 0.82 0.78 0.79 59
weighted avg 0.83 0.83 0.82 59
Accuracy obtained from the given model is: 0.8305084745762712
Thus, the accuracy obtained in this case is 83.05%.
num_est = np.arange(1,200)
for n_est in num_est:
bag_clf = BaggingClassifier(base_estimator = dTree_model, n_estimators = n_est, random_state = 1)
bag_clf = bag_clf.fit(X_train, y_train)
pred = bag_clf.predict(X_test)
print(n_est, " ", accuracy_score(y_test, pred))
1 0.8135593220338984 2 0.7288135593220338 3 0.8135593220338984 4 0.847457627118644 5 0.847457627118644 6 0.847457627118644 7 0.847457627118644 8 0.847457627118644 9 0.847457627118644 10 0.8305084745762712 11 0.8305084745762712 12 0.864406779661017 13 0.847457627118644 14 0.847457627118644 15 0.864406779661017 16 0.8813559322033898 17 0.8813559322033898 18 0.864406779661017 19 0.8813559322033898 20 0.8813559322033898 21 0.8813559322033898 22 0.8813559322033898 23 0.864406779661017 24 0.8813559322033898 25 0.8813559322033898 26 0.8813559322033898 27 0.8813559322033898 28 0.8813559322033898 29 0.8813559322033898 30 0.8813559322033898 31 0.8813559322033898 32 0.8813559322033898 33 0.8813559322033898 34 0.8813559322033898 35 0.864406779661017 36 0.8813559322033898 37 0.8813559322033898 38 0.8813559322033898 39 0.8813559322033898 40 0.8813559322033898 41 0.8813559322033898 42 0.8813559322033898 43 0.8813559322033898 44 0.8813559322033898 45 0.864406779661017 46 0.864406779661017 47 0.864406779661017 48 0.864406779661017 49 0.864406779661017 50 0.8813559322033898 51 0.8813559322033898 52 0.8813559322033898 53 0.8813559322033898 54 0.8813559322033898 55 0.8813559322033898 56 0.8813559322033898 57 0.8813559322033898 58 0.8813559322033898 59 0.8813559322033898 60 0.8813559322033898 61 0.8813559322033898 62 0.8813559322033898 63 0.8813559322033898 64 0.8813559322033898 65 0.8813559322033898 66 0.8813559322033898 67 0.8813559322033898 68 0.8813559322033898 69 0.8813559322033898 70 0.8813559322033898 71 0.8813559322033898 72 0.8813559322033898 73 0.8813559322033898 74 0.8813559322033898 75 0.8813559322033898 76 0.8813559322033898 77 0.8813559322033898 78 0.8813559322033898 79 0.8813559322033898 80 0.8813559322033898 81 0.8813559322033898 82 0.8813559322033898 83 0.8813559322033898 84 0.8813559322033898 85 0.8813559322033898 86 0.8813559322033898 87 0.8813559322033898 88 0.8813559322033898 89 0.8813559322033898 90 0.8813559322033898 91 0.8813559322033898 92 0.8813559322033898 93 0.8813559322033898 94 0.8813559322033898 95 0.8813559322033898 96 0.8813559322033898 97 0.8813559322033898 98 0.8813559322033898 99 0.8813559322033898 100 0.8813559322033898 101 0.8813559322033898 102 0.8813559322033898 103 0.8813559322033898 104 0.8813559322033898 105 0.8813559322033898 106 0.8813559322033898 107 0.8813559322033898 108 0.8813559322033898 109 0.8813559322033898 110 0.8813559322033898 111 0.8813559322033898 112 0.8813559322033898 113 0.8813559322033898 114 0.8813559322033898 115 0.8813559322033898 116 0.8813559322033898 117 0.8813559322033898 118 0.8813559322033898 119 0.8813559322033898 120 0.8813559322033898 121 0.8813559322033898 122 0.8813559322033898 123 0.8813559322033898 124 0.8813559322033898 125 0.8813559322033898 126 0.8813559322033898 127 0.8813559322033898 128 0.8813559322033898 129 0.8813559322033898 130 0.8813559322033898 131 0.8813559322033898 132 0.8813559322033898 133 0.8813559322033898 134 0.8813559322033898 135 0.8813559322033898 136 0.8813559322033898 137 0.8813559322033898 138 0.8813559322033898 139 0.8813559322033898 140 0.8813559322033898 141 0.8813559322033898 142 0.8813559322033898 143 0.8813559322033898 144 0.8813559322033898 145 0.8813559322033898 146 0.8813559322033898 147 0.8813559322033898 148 0.8813559322033898 149 0.8813559322033898 150 0.8813559322033898 151 0.8813559322033898 152 0.8813559322033898 153 0.8813559322033898 154 0.8813559322033898 155 0.8813559322033898 156 0.8813559322033898 157 0.8813559322033898 158 0.8813559322033898 159 0.8813559322033898 160 0.8813559322033898 161 0.8813559322033898 162 0.8813559322033898 163 0.8813559322033898 164 0.8813559322033898 165 0.8813559322033898 166 0.8813559322033898 167 0.8813559322033898 168 0.8813559322033898 169 0.8813559322033898 170 0.8813559322033898 171 0.8813559322033898 172 0.8813559322033898 173 0.8813559322033898 174 0.8813559322033898 175 0.8813559322033898 176 0.8813559322033898 177 0.8813559322033898 178 0.8813559322033898 179 0.8813559322033898 180 0.8813559322033898 181 0.8813559322033898 182 0.8813559322033898 183 0.8813559322033898 184 0.8813559322033898 185 0.8813559322033898 186 0.8813559322033898 187 0.8813559322033898 188 0.8813559322033898 189 0.8813559322033898 190 0.8813559322033898 191 0.8813559322033898 192 0.8813559322033898 193 0.8813559322033898 194 0.8813559322033898 195 0.8813559322033898 196 0.8813559322033898 197 0.8813559322033898 198 0.8813559322033898 199 0.8813559322033898
From above we can see that after 50 base estimators the accuracy reaches a plateau. So,
bag_modelR = BaggingClassifier(base_estimator= dTree_model, n_estimators= 50, random_state = 1)
bag_modelR.fit(X_train, y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False,
random_state=1,
splitter='best'),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=50, n_jobs=None,
oob_score=False, random_state=1, verbose=0, warm_start=False)
pred_bagR = bag_modelR.predict(X_test)
predictprob_bagR = bag_modelR.predict_proba(X_test)
# print classification report and accuracy score:
print('Classification report for the model is given as:', '\n', classification_report(y_test, pred_bagR))
print('Accuracy obtained from the given model is:', accuracy_score(y_test, pred_bagR))
Classification report for the model is given as:
precision recall f1-score support
0 0.93 0.68 0.79 19
1 0.87 0.97 0.92 40
accuracy 0.88 59
macro avg 0.90 0.83 0.85 59
weighted avg 0.89 0.88 0.88 59
Accuracy obtained from the given model is: 0.8813559322033898
Thus, on increaing the size of ensemble from 10 to 50, there has been a considerable increase in the accuracy of the model. The improved accuracy stands at 88.14%.
# Confusion Matrix:
cm_bag = confusion_matrix(y_test, pred_bagR)
class_label = ['Positive', 'Negative']
df_cm_bag = pd.DataFrame(cm_bag, index = class_label, columns = class_label)
ax = sns.heatmap(df_cm_bag, annot = True, fmt = 'd')
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
From the above Classification Matrix it is clear that the model has predicted 14 records as Positive which were actually positive and 5 records as False Positive. Also, the model had correctly identified 39 records as Negative and 1 negative record as positive.
Before deciding which model (Logistic Regression, KNN, Naive Bayes, Support Vector Machine, StackingClassifier and BaggingClassifier) is best lets summarise the results from each of the model. All the six models had predicted 19 records as Positive (individuals not being affected by Parkinson) and 40 records as Negative (individuals being affected by Parkinson disease) with varying amount of True Positive, True Negative, False Positive and False Negative.
Logistic Regression:
This algorithm provided an accuracy of 86.44 %. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict for 11 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict all of them.
KNN (K-Nearest Neighbor):
This algorithm provided an accuracy of 91.52%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict for 16 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict 38 of them. For 2 of the records (i.e., person suffering from Parkinson) it was unable to predict them.
Naive Bayes:
This algorithm provided an accuracy of 67.8%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict only 13 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict only 27 of them. It missed 13 records where the person was actually suffering from Parkinson.
Support Vector Machine (SVC):
This algorithm provided an accuracy of 93.22%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict 15 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict all 40 of them.
StackingClassifier:
This algorithm provided an accuracy of 91.5%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict 14 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict all 40 of them.
Bagging Classifier:
This algorithm provided an accuracy of 88.14%. For 19 records present in test data not being affected by Parkinson, it was able to correctly predict 14 of them. For 40 records present in test data being affected by Parkinson it was able to correctly predict 39 of them. It just missed 1 records where the person was actually suffering from Parkinson.
Thus, we can see that Support Vector Machine (SVC) has the best accuracy among all the six algorithms that has been used here. Apart from these it had also identified the most number of persons suffering from Parkinson correctly (alongwith StackingClassifier and Logistic Regression).
Thus, in this case we can say that Support Vector Machine (SVC) is the best model out of the six.